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Abstract 

In this paper we describe active set type algorithms for minimization of a smooth function 
under general order constraints, an important case being functions on the set of bimonotone 
r X s matrices. These algorithms can be used, for instance, to estimate a bimonotone re- 
gression function via least squares or (a smooth approximation of) least absolute deviations. 
Another application is shrinkage estimation in image denoising or, more generally, regression 
problems with two ordinal factors after representing the data in a suitable basis which is in- 
dexed by pairs e {1, . . . , r} x {1, . . . , s}. Various numerical examples illustrate our 
methods. 
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1 Introduction 

Monotonicity and other qualitative constraints play an important role in contemporary nonpara- 
metric statistics. One reason for this success is that such constraints are often plausible or even 
justified theoretically, within an appropriate mathematical formulation of the appUcation. More- 
over, by imposing shape constraints one can often avoid more traditional smoothness assumptions 
which typically lead to procedures requiring the choice of some tuning parameter. A good starting 
point for statistical inference under qualitative consttaints is the monograph by Robertson et al. 
[10]. 

Estimation under order constraints leads often to the following optimization problem: For 
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some dimension p > 2 let Q : W ^ Rhe a given functional. For instance, 

p 

(1) Qi9) = ^wuiZu-euf 

u=l 

with a certain weight vector w G (0, oo)^ and a given data vector Z G W. In general we assume 
that Q is continuously differentiable, strictly convex and coercive, i.e. 

Q{0) — oo as \\6\\ — oo, 

where || • || is some norm on MP. The goal is to minimize Q over the following subset K of M^: 
Let Che a given collection of pairs {u, v) of different indices u,v e {1,2, . . . ,p}, and define 

K = K{C) = {e eRP ■.eu<e^ for all {u, v)eC}. 

This defines a closed convex cone in MP containing all constant vectors. 

For instance, if C consists of (1, 2), (2, 3), . . . , {p — l,p), then IC is the cone of all vectors 
such that 6i < 62 < • • • < Op. Minimizing (1) over all such vectors is a standard problem 
and can be solved in 0{p) steps via the pool-adjacent-violators algorithm (PAVA). The latter was 
introduced in a special setting by Ayer et al. [1] and extended later by numerous authors, see [10] 
and Best and Chakravarti [3]. 

As soon as Q(-) is not of type (1) or C differs from the aforementioned standard example, the 
minimization oiQ{-) over K becomes more involved. Here is another example for K and C which 
is of primary interest in the present paper: Let p = rs with integers r, s > 2, and identify W with 
the set M*"^* of all matrices with r rows and s columns. Further let Kr^s be the set of all matrices 
eeW''^ such that 

< whenever i < r and 9ij < Oij+i whenever j < s. 

This corresponds to the set Cr,s of all pairs [{i,j),{k,l)) with i,k G {l,...,r} and j,£ G 
{1, . . . ,s} such that either {k,£) = {i + or {k,£) = + 1). Hence there are #C = 
2rs — r ~ s constraints. 

Minimizing the special functional (1), i.e. Q{6) = J2i j '^iji^ij ~ ^ij)^' over the bimonotone 
cone Kj. s is a well recognized problem with various proposed solutions, see, for instance, Spouge 
et al. [11], Burdakow et al. [4], and the references cited therein. However, all these algorithms 
exploit the special structure of Kr^s or (1). For general functionals Q( ), e.g. quadratic functions 
with positive definite but non-diagonal hessian matrix, different approaches are needed. 
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The remainder of this paper is organized as follows. In Section 2 we describe the bimonotone 
regression problem and argue that the special structure (1) is sometimes too restrictive even in 
that context. In Section 3 we derive possible algorithms for the general optimization problem 
described above. These algorithms involve a discrete optimization step which gives rise to a 
dynamic program in case of IK = Kr^s- For a general introduction to dynamic programming see 
Cormen et al. [6]. Other ingredients are active methods as described by, for instance, Fletcher [9], 
Best and Chakravarti [3] or Diimbgen et al. [8], sometimes combined with the ordinary PAVA in 
a particular fashion. It will be shown that all these algorithms find the exact solution in finitely 
many steps, at least when Q{-) is an arbitrary quadratic and strictly convex function. Finally, in 
Section 4 we adapt our procedure to image denoising via bimonotone shrinkage of generalized 
Fourier coefficients. The statistical method in this section was already indicated in Beran and 
Diimbgen [2] but has not been implemented yet, for lack of an efficient computational algorithm. 

2 Least squares estimation of bimonotone regression functions 

Suppose that one observes {x^,y^,Z^), {x^,y^,Z^), with real components x*, y* 
and Z*. The points (x*, y*) are regarded as fixed points, which is always possible by conditioning, 
while 

Z' = ii{x\y')+e' 

for an unknown regression function /x : M x M — >^ M and independent random errors e^, e^, . . . , 
with mean zero. In some appUcations it is plausible to assume ^ to be bimonotone increasing, i.e. 
non-decreasing in both arguments. Then it would be desirable to estimate ji under that constraint 
only. One possibility would be to minimize 

n 

t=l 

over all bimonotone functions /x. The resulting minimizer jl is uniquely defined on the finite set of 
all design points (x*, y*), 1 <t <n. 

For a more detailed discussion, suppose that we want to estimate on a finite rectangular grid 

■'^<'i<r,'\.<j <s}, 

where x^i) < X(2) < • • • < X(^r) 2/(1) < 2/(2) < • • • < y(s) contain at least the different 
elements of {x^, x^, . . . , x"} and {y^,y^, ■ ■ ■ , y"}, respectively, but maybe additional points as 
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well. For 1 < i < r and 1 < j < s let Wij be the number of alH € {1, . . . , n} such that (.x*, y*) = 
and let Zij be the average of over these indices t. Then Ylt=ii^^ ~ 

equals 

where = {Oij)i,j stands for the matrix G K^^s. 

Setting 1: Complete layout. Suppose that wij > for all (z, j) e {1, . . . , r} x {1, . . . , s}. 
Then the resulting optimization problem is precisely the one described in the introduction. 

Setting 2a: Incomplete layout and simple interpolation/extrapolation. Suppose that the set 
U of all index pairs (i, j) with Wij > differs from {1, . . . , r} x {1, . . . , s}. Then 

Q{e) = J2 MZu - 0u? 

ueu 

fails to be coercive. Nevertheless it can be minimized over Kj.^s with the algorithms described 
later. Let be such a minimizer. Since it is uniquely defined on U only, we propose to replace it 
withe = 2-1(0 + 0), where 

= max(^{ei>f : eU,i' < i, f < j} U {Omin}) , 

dij = mm(^{ei'f : £U,i< < j'} U {^max}), 

and ^min and ^max denote the minimum and maximum, respectively, of {9u : u eU}. Note that 
and belong to Kr^s and are extremal in the sense that any matrix 9 G Kr^s H [^min, ^max]'^^'' 
with 9u = 6u for allu satisfies necessarily < 0^ < 9ij for all (i, j). 

Setting 2b: Incomplete layout and light regularization. Instead of restricting one's attention 
to the index set U, one can estimate the full matrix {lJ-{x(i),y(j)))j^j € W^^ by minimizing a 
suitably penalized sum of squares, 

over Kr,s for some small parameter A > 0. Here P{-) is a convex quadratic function on M''^'' 
such that Q{-) is strictly convex. One possibility would be Tychonov regularisation with P{9) = 
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X^i ji^ij ~ and a certain reference value 9o, for instance, 60 = j WijZij/ Y- ■ Wij. In our 
particular setting we prefer the penalty 

(2) P{e) = Yl {Gu-Oijf, 

because it yields smoother interpolations than the recipe for Setting 2a or the Tychonov penalty. 
One can easily show that the resulting quadratic function Q is strictly convex but with non-diagonal 
hessian matrix. Thus it fulfills our general requirements but is not of type (1). 

Note that adding a penalty term such as (2) could be worthwhile even in case of a complete 
layout if the underlying function fi is assumed to be smooth. But this leads to the nontrivial task 
of choosing A > appropriately. Here we use the penalty term mainly for smooth interpola- 
tion/extrapolation with A just large enough to ensure a well-conditioned Hessian matrix. We refer 
to this as "light regularization", and the exact value of A is essentially irrelevant. 

Example 2.1 To illustrate the difference between simple interpolation/extrapolation and light 
regularization with penalty (2) we consider just two observations, {x^,y^,Z^) = (2,3,0) and 
{x^, y"^, Z^) = (6, 7, 1), and let r = 7, s = 10 with = i and y(^j^ = j. Thus Wij = except 
for W2,3 = W6J = 1, while ^2,3 = and Zqj = 1. Any minimizer 6 of JZueU '^u{Zu - du)'^ 
over IKy^io satisfies ^2,3 = and 9qj = 1, so the recipe for Setting 2a yields 

'0 ifi<2,j<3, 
- if^>6,i>7, 
0.5 else. 

The left panel of Figure 1 shows the latter fit 0, while the right panel shows the regularized fit 
based on (2) with A = 10~^. In these and most subsequent pictures we use a gray scale from 
black = to white = 1. 



Example 2.2 (Binary regression). We generated a random matrix Z G {0,1}''^** with r = 70 
rows, s = 100 columns and independent components Zij, where 

p/7 IX n + , l{yo-) > l/2 + cos(7rx(,))/4} 
f[Zij = 1) = 9ij = \ 

with = (^ — 0.5) /r and = (j — 0.5) /s. Thereafter we removed randomly all but 700 
of the 7000 components Zij. The resulting data are depicted in the upper left panel of Figure 2, 
where missing values are depicted grey, while the upper right panel shows the true signal 0. The 
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Figure 1: Simple interpolation/extrapolation versus light regularization 



lower panels depict the least squares estimator with simple interpolation/extrapolation (left) and 
light regularization based on (2) with A = 10^ ^ (right). Note that both estimators are very similar. 
Due to the small value of A, the main differences occur in regions without data points. 

The quality of an estimator 6 for may be quantified by the average absolute deviation, 

i=i i=i 

For the estimator with simple interpolation/extrapolation, AAD turned out to be 7.5607 • 10^^, the 
estimator based on light regularization performed sUghtly better with AAD = 7.4039 • 10~^. 

3 The general algorithmic problem 

We return to the general framework introduced in the beginning with a continuously differentiable, 
strictly convex and coercive functional Q : MP ^ M. and a closed convex cone K = K{C) G W 
determined by a collection C of inequality constraints. 

Before starting with explicit algorithms, let us characterize the point 

= argmin Q{6). 

em. 

It is well-known from convex analysis that a point G K coincides with 6 if, and only if, 

(3) VQ{eye = < VQ{eyr} for all rjeK, 

where VQ{6) denotes the gradient of Q at 0. This characterization involves infinitely many 
inequalities, but it can be replaced with a criterion involving only finitely many constraints. 



data 



true signal 




Figure 2: Binary regression with incomplete layout 
3.1 Extremal directions of K 

Note that K contains all constant vectors cl, c G M, where 1 = Ip = (l)f=i. It can be represented 
as follows: 

Lemma 3.1 Define 

£ = Kn{0,l}P. 
Then any vector x eK may be represented as 

X = min(£c)l + ^ AgC 

eee 

with coefficients Ae > such that J2ee£ ~ max(a;) — min(a;). 



Here min(a;) and max(a;) denote the minimum and maximum, respectively, of the compo- 
nents of X. 



Modified characterization of 6. By means of Lemma 3.1 one can easily verify that (3) is equiv- 
alent to the following condition: 

(4) Vg(6>)^6> = < VQ(6>)^e for all e G U {-1}. 

Thus we have to check only finitely many constraints. Note, however, that the cardinality of £ 
may be substantially larger than the dimension p, so that checking (4) is far from trivial. 

Application to 'K.r,s- Applying Lemma 3.1 to the cone 'K.r,s C W^^ yields the following repre- 
sentation: With 

£r,s = Kr,s n {0, Ij'-x^ 

any matrix £c G IK may be written as 

X = QqIj-xs "I" ^ ^ Ae6 

with coefficients Oo G K and Ag > 0, e G £r,s- 

There is a one-to-one correspondence between the set £r,s and the set of all vectors e G 
{1, 2, . . . , r + sY with components ei < 62 < • • • < via the mapping 

s 

Since such a vector e corresponds to a subset of{l,2,...,r + s} with r elements, we end up with 

Hence the cardinality of £r,s grows exponentially in min(r, s). Nevertheless, minimizing a linear 
functional over fr,s is possible in 0{rs) steps, as explained in the next section. 

Proof of Lemma 3.1. For a; G IK let ao < ai < • • • < Om be the different elements of 
{xi,X2, . . . , Xp}, i.e. ao = min(a;) and am = max(a;). Then 

m 

X = apl + y^^{ai - ai-i){l{xt > aiPLv 

i=l 

Obviously, these weights — Oj-i are nonnegative and sum to max(a;) — min(a;). Furthermore, 
one can easily deduce from a; G IK that {l{xt > belongs to £ for any real threshold a. □ 
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3.2 A dynamic program for Sr,s 

For some matrix a G M''^* let L : — > M be given by 

r s 
i=l j=l 

The minimum of L( ) over the finite set 6r,s may be obtained by means of the following recursion: 
For 1 < A; < r and 1 < ^ < s define 

r s 



i=k j=l 

r s 

H{k, s+1) = minj^ ^ aijCij : e e Sr,sj- 



Then 



and 



i=k j=l 



mill L{e) = H{l,s+ 1), 



i=k j=l 



s 

H{k,e+1) = min(^H{k,£), ^ + H{k + 1,1 + ifj 

j=e+i 

where we use the conventions that H{k+1,-) = and Ylj=s+i ' = 0- In the recursion formula for 
^+1), the term aij + H{k + l,£+l) is the minimum of Lfc(e) = Yll=k ^j=i "y^^i 

over all matrices e G Sr.s with e^e = and e^^^+i = 1 (if ^ < s), while H{k,£) is the minimum 
of Lk{e) over all e G £k,s with eke = I. 

Table 1 provides pseudocode for an algorithm that determines a minimizer of L( ) over 5^,5- 

3.3 Active set type algorithms 

Throughout this exposition we assume that minimization of Q over an affine linear subspace of 
MP is feasible. This is certainly the case if Q is a quadratic functional. If Q is twice continuously 
differentiable with positive definite Hessian matrix everywhere, this minimization problem can be 
solved with arbitrarily high accuracy by a Newton type algorithm. 

All algorithms described in this paper alternate between two basic procedures which are 
described next. In both procedures G IK is replaced with a vector ©new € IK such that 

Q{Onew) < Q{0) unless ©new = 0- 



Algorithm e ^ DynamicProgram(a) 

^ ^ (X]i=£ "'*:.i)fc<r/<s+l 
■'^ ^ (^)fe<r+l/<s+l 

for A; <— r downto 1 do 

for £ ^ 1 to s do 

Hk,e+i ^ min(-fffc/, + 
end for 
end for 

^ ^ (^)fc<r,£<s 

while A; < r and ^ > 1 do 

if Hk,e+i = Hk,e then 

else 

end if 
end while. 



Table 1 : Minimizing a linear functional over f ^ 



Basic procedure 1: Checking optimality of G K 



Suppose that G IK satisfies already the the following two equations: 



(5) 



vQ{eye = = vQ(6>)^i. 



According to (3), this vector is already the solution 6 if, and only if, VQ(0)^e > for all e E £. 
Thus we determine 

A e argmin VQ{Oye 

and do the following: If VQ{0)~^ A > 0, we know that = and stop the algorithm. Otherwise 
we determine 

to = argmin Q{0 + tA) > 



and replace with 



^new ■— + toA. 



This vector Onew lies in the cone IK, too, and satisfies the inequality Q{Onew) < Q{0)- Then we 
proceed with basic procedure 2. 
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Basic procedure 2: Replacing 6 witli a "locaDy optimal" point ©new € K 
The general idea of basic procedure 2 is to find a point ©new G IK such that 
(6) ©new = argmin Q{x) 

a:eV 

for some V in a finite family V of linear subspaces of W. Typically these subspaces V are ob- 
tained by replacing some inequality constraints from C with equaUty constraints and ignoring the 
remaining ones. This approach is described below as basic procedure 2a. But we shall see that it 
is potentially useful to modify this strategy; see basic procedures 2b and 2c. 

Basic procedure 2a: The classical active set approach. For © G IK define 

V(©) = {x eMP : Xu = Xy for all {u, v) e C with 6'„ = 9^}. 

This is a linear subspace of W containing 1 and © which is determined by those constraints from 
C which are "active" in ©. It has the additional property that for any vector x G V(©), 

A(©, x) = max{t G [0, 1] : (1 - i)© + G K} > 0. 

Precisely, A(©, a;) = 1 if £c G K, and otherwise, 

a a 

X{0,x) = min 



{u,v)eC : Xu>Xv On — du — Xx, -\- Xu 

The key step in basic procedure 2a is to determine Xq = argmin^gy^^) Q{x) and A(©, Xo). 
If Xo G K, which is equivalent to A(©, Xo) = 1, we are done and return ©new = Xg. This 
vector satisfies (6) with V = V(©) and V = V(©new)- The latter fact follows simply from 
V(©new) C V(©). If Xo IK, we repeat this key step with ©new = (1 ~ Xo)Q + A(©, Xo)xo 
in place of ©. 

In both cases the key step yields a vector ©new satisfying (5(©ncw) < Q(©)» unless Xo = 6. 
Moreover, if Xo IK, then the vector space V(©ncw) is contained in V(©) with strictly smaller 
dimension, because at least one additional constraint from C becomes active. Hence after finitely 
many repetitions of the key step, we end up with a vector ©new satisfying (6) with V = V(©new)- 
Table 2 provides pseudocode for basic procedure 2a. 

Basic procedure 2b: Working with complete orders. The determination and handling of the 
subspace V(©) in basic procedure 2a may be rather involved, in particular, when the set C consists 
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Algorithm 0new BasicProcedure2a(0) 

^new ^ ^ 

Xo ^ argmm^£y(0^^^) Q{x) 

A ^ A(0ncw) 3^0) 

while A < 1 do 

^new ■* — (1 ~ A)0ncw + AXq 

aJo ^ argmm^gv(0ncw) 

A <— A(0newj 3^0) 

end while 



Table 2: Basic procedure 2a 

of more than p constraints. One possibility to avoid this is to replace V(^) and K in the key step 
with the following subspace V*(0) and cone W{0), respectively: 

K*{e) = {xeRP ■.forallu,ve{l,...,p}, Xu<xvif9u<0v}- 

Note that 1,6 G K*{e) C V*(6'), and one easily verifies that K*{e) c IK if 6* G K. Basic 
procedure 2b works precisely like basic procedure 2a, but with V*(-) in place of V(-), and X{0, x) 
is replaced with 

X*{e,x) = ma^{te[0,l]:{l-t)e + txeK*{e)}. 

Then basic procedure 2b yields a vector Onew satisfying (6) with V = V*(0new)- 

When implementing this procedure, it is useful to determine a permutation a{-) of {1, . . . ,p} 
such that ^0^(1) < 0^(2) < • • • < ^(t(p)- Let 1 < ii < 12 < ■ ■ ■ < iq = p denote those indices i 
such that ^o^(j) < ifi < p. Then, with io = 0, 

Y*{0) = [x e MP : for 1 < i < q, is constant ini e {ie-i + 1, . . . ,ie}}, 

K*ie) = {xeY*ie):forl<e<q, < 



and 



x*{e,x) = 



mm 



Basic procedure 2c: A shortcut via the PAVA. In the special case of Q{9) being the weighted 
least squares functional in (1), one can determine 



6>. 



argmin Q{x) 
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directly by means of the PAVA with a suitable modification for the equality constraints defining 

Y*{e). 

The whole algorithm and its validity 

All subspaces Y{G) and Y*{6), 6 £ K, correspond to partitions of {1,2, ... ,p} into index sets. 
Namely, the linear subspace corresponding to such a partition consists of all vectors x £ W with 
the property that Xu = Xy for arbitrary indices u, v belonging to the same set from the partition. 
Thus the subspaces used in basic procedures 2a-b belong to a finite family V of Unear subspaces 
of all containing 1. 

We may start the algorithm with initial point 

argminQ(tl) ) • 1. 
Now suppose that Q^^\ . . . ,0^^^ G K have been chosen such that 

0W = argmin Q(x) for 1 < ^ < A; 

with linear spaces 

V(o)^___^V(fe) g y_ Then 6> = d'^^^ satisfies (5), and we may apply basic 
procedure 1 to check whether d^^^ = 0. If not, we may also apply a variant of basic procedure 2 
to get 6>('=+^) G K minimizing Q on a linear subspace V('=+^) G V, where Q(6>(*^+^)) < Q{e'^^'>). 
Since V is finite, we will obtain 6 after finitely many steps. 

Similar arguments show that our algorithm based on basic procedure 2c reaches an optimum 
after finitely many steps, too. 

Final remark on coercivity. As mentioned for Setting 2a, the algorithm above may be applica- 
ble even in situations when the functional Q fails to be coercive. In fact, we only need to assume 
that Q attains a minimum, possibly non-unique, over any linear space V(0), Y*{6) or any cone 
and we have to able to compute it. In Setting 2a, one can verify this easily. 

4 Shrinkage estimation 

We consider a regression setting as in Section 2, this time with Gaussian errors £* ~ AA(0, a"^). As 
before, the regression function : M x M — > M is reduced to a matrix 

M=(Mx»,yo))),,.eM'-- 

13 



for given design points < X(2) < • • • < x^^) and y^i) < 2/(2) < • • • < y(s)- This matrix is 
no longer assumed to be bimonotone, but the latter constraint will play a role in our estimation 
method. 

4.1 Transforming the signal 

At first we represent the signal M with respect to a certain basis of R^'^'^. To this end let U = 
[ui U2 . . . Ur] and V = [viV2 ■ . . Vg] be orthonormal matrices in W^^ and M*^^, respectively, 
to be specified later. Then we write 

M = UMV^ = ^MijUivJ with M = MV = {ujMVj)... 

Thus M contains the coefficients of M with respect to the new basis matrices u^vj G W^^. 
The purpose of such a transformation is to obtain a transformed signal M with many coefficients 
being equal or at least close to zero. 

One particular construction of such basis matrices U and V is via discrete smoothing splines: 
For given degrees k,£> 1, consider annihilators 



Oil • • • ai,fe+i 

022 • • • a2,fe+2 



B 





hi 



■ ■ ■ bi,e+i 

^22 • • • ^'2,^+2 





O) — k,r 



bs-e,! 



{r—k)xr 



{s-e)xs 



with unit row vectors such that 



^(4))I=i = fore = 0,...,A;-l, 
^(2/(i))-=i = fore = 0,...,^-l. 



An important special case is A; = ^ = 1. Here 



1 

71 



1 -1 
1 







and B 



1 

71 



1 -i_ 

satisfy the equations Air = and Big = 0. 



1 -1 
1 
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Next we determine singular value decompositions of A and B, namely, 



B 



U ■ [0 
V ■ [0 



{r—k}xk ^ 



^ diag(ai, . . .,ar-k)] ■ U 



{s-e)xe ^ 



< ai < ••• < a^_^ 

diag(6i, . . .,bs-i)] ■ V 



< 6i < ••• < bs-e 



with column-orthonormal matrices U,U = [wi ti2 • • • Ur], V and V = [viV2 ■ ■ ■ Vg]. The 
vectors ui,. . . ,Uk and vi,. . . ,Vi correspond to the space of polynomials of order at most k and 
£, respectively. In particular, we always choose ui = r~^/^lr and vi = Then 



M = 



Mil '^I'^i 



1=2 



'J 



(constant part) 
(additive part) 

(interactions) 



One may also write 



M 



U 



polynomial part 

kxi 


half-polyn. interactions 

kx{s-i) 


half-polyn. interactions 

{r-k)x£ 


non-polyn. interactions 

{r-k)x{s- t) 



For moderately smooth functions jj, we expect \Mij\ to have a decreasing trend vai > k and in 
j > £. This motivates a class of shrinkage estimators which we describe next. 

4.2 Shrinkage estimation in the simple balanced case 

In the case of n = p = rs observations such that each grid point (x^j), y^j-j) is contained in 
{{x^,y^), . . . , (x" , y") } , our input data may be written as a matrix 

Z = M + e 

with £ G M''^* having independent components Sij ~ AA(0,fj^). Reexpressing such data with 
respect to the discrete spline basis leads to Z = M+e with Z := ZV and e := sV. Note 
that the raw data Z is the maximum likelihood estimator of M. To benefit from the bias-variance 
trade-off, we consider component-wise shrinkage of the coefficient matrix Z: For 7 G [0, l]*"^* 
we consider the candidate estimator 



(7) 
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Eventually we will choose a shrinkage matrix 7 depending on the data and compute the shrinkage 
estimator 



(8) 



M = M 



Let II A Hi? denote the Frobenius norm of a matrix A, i.e. || A 




trace(A^A). 



As a measure of risk of the estimator (7), we consider 



R{l,M) = E||M^^ -M||J 




MB 




Here we used the fact that the transformed error matrix i has the same distribution as £. An 
estimator of this risk is given by 



where a is a certain estimator of a, e.g. based on high frequency components of Z, see later. 

Thus optimal shrinkage factors would be given by 7^ = M^j/ {M^j + a^), but these depend 
on the unknown signal M. Naive estimators would be jij = {1 — a"^ / Zf-)^ . The resulting 
estimator's performance is rather poor, but it improves substantially if 7 in (8) is given by 



with T close to 2; cf. Donoho and Johnstone [7]. 

An alternative strategy, utilized for instance by Beran and Diimbgen [2], is to restrict 7 to a 
certain convex set of shrinkage matrices serving as a caricature of the optimal 7. The previous 
considerations suggest to restrict —7 to be contained in K^f/^ the set of all matrices G W^^ 
such that 

• di^j = 02 J = • • • = 9k J is non-decreasing in j > £, 

• = &i,2 = • • • = 9i/ is non-decreasing in i > A:, 

• {dij)i>k,j>l belongs to Kr^k,s-l- 




(9) 
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The set of all such shrinkage matrices 7 is denoted by G^f/'* = (— K^^^/^) n [0, 1]'"^'*. Thus 
we propose to use the shrinkage matrix 

(10) 7 = argmin -^(7). 

In the present setting one can show (cf. [2]) that 

7 = argmin i?(7, M") = ( — — — ^) 



with f] = — argmin ^(—(Mj^ -|- cr^) — ^ij)' 



0eK$.^/* 1,3 



Similarly, 



7 = argmin i?(7) = ((1 - (tV%)"^)^ 



r,s 



with T] = — argmin — OjjY 

This allows one to experiment with different values for a with little effort. 



Estimation of the noise level. Two particular estimators are given by 

. / Y.i/r+j/s>K^ij Y^'^ . Median{\Zij\ : i/r + j/s > k) 



Mihj):i/r+j/s>K}J $-1(3/4) 
for a certain number k G (0, 2), where denotes the standard Gaussian quantile function. The 
idea is that f or i > > 1 and j » 1 , the components Zij are essentially equal to the noise variables 
iij ~ M{0, cr^). Otherwise both estimators tend to overestimate a. 

As to the choice of k, we propose to choose it via visual inspection of the graphs of k<-^ ai^^ 
and K a2,K- Typically these functions are almost constant and close to cr on a large subinterval 
of (0, 2), non-increasing to the left of that interval, and show random fluctuations to the right. As 
we shall illustrate later, the quality of the shrinkage estimator is rather robust with respect to the 
estimator a. In particular, overestimating a shghtly is typically harmless or even beneficial. 

Consistency. We now augment the foregoing discussion with consistency results that follow 
from more general considerations in [2]. First of all, for large p, the normalized quadratic loss 
II JVf^''^^— M|||. of a candidate estimator is close to its normahzed riskp~ii?(7, M), uniformly 
over 7 G Gr^/^ . Precisely, 

E sup |p-^||M(^^-M||^-p-ii?(7,M)| < + 



max(r, s)-*-/^ 
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with C denoting a generic universal constant. Moreover, if the variance estimator is Li- 
consistent, the normaUzed estimated risk ^"^^(7) differs little from the normaUzed true risk 
p-^i?(7, M), uniformly in 7 e Qff\ Namely, 

E sup \p-^R{^)-p-^R{-i,M)\ < C ^' + ^^"'^'''^"^ +CE|a^-a^|. 
-yeGff max(r,s)V2 

In particular, the shrinkage matrix 7 in (10) and the corresponding estimator M = M satisfy 
the inequalities 

^\p-^R{l)-p-^RnAn{M)\ \ ^a^+ap-y^M\\F ^^,.2 21 

r — ^ ; ^ttt h u E o" — cr , 

E|p-^||M-M|||-p-ii?min(M)| J max(r,s)V2 

where i?inin(-^) denotes the minimum of -R(7, M) over all 7 G G^f/^. 

Example 4.1 We generated a random matrix Z G W^^ with r = 60 rows, s = 100 columns and 
independent components Zjj ~ A/"(ju(x(j), l), where = (z — 0.5)/r, y(j) = (j — 0.5)/s, 
and 

/x(ar,?/) = 2r(x,y)-°-2^sin(T(x,y)) + 0.05(a; + ?/), r(x,y) = VSx^ + 2xy + 3y2 + 1. 

We smoothed this data matrix Z as described above with annihilators of order k = £ = 2. The es- 
timators ai^K and a2,K turned out to be almost constant and sUghtly smaller than 1 .0 on (0.5, 0.65), 
so we chose a = 1. The first row of Figure 3 shows gray scale images of the raw data Z (left) and 
the true signal M (right). The second and third row depict the matrix M for different values of 
a. Precisely, to show the effect of varying the estimated noise level, we replaced a with ca, where 
c = 0.5 (undersmoothing), c = 1.0 (original estimator), c = 1.5 (oversmoothing) and c = 2.0 
(heavy oversmoothing). In these pictures the gray scale ranges from —7 (black) to 7 (white). 

Figure 4 depicts the transformed squared coefficients Zfj/{1 + Zfj) (left panel) and the bi- 
monotone shrinkage matrix 7 (right panel). 

Figure 5 shows the average squared loss p~^ \\M — Af ||p as a function of a. The emerging 
pattern is very stable over all simulations we looked at. This plot and figure 4 show that there is a 
rather large range of values for a leading to estimators of similar quality. Overestimation of a is 
less severe than underestimation and sometimes even beneficial. 

Since this is just one simulation, we also conducted a simulation study. We generated 5000 
such data matrices Z. Each time we estimated the noise level yia a = an. Then we computed 
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the shrinkage estimators M in (8), where the shrinkage matrices 7 were given by (10) and by 
(9) with T running through a fine grid of points in (0, 2]. It turned out that r = 0.60 yielded 
optimal performance, although this value depends certainly on the underlying signal and noise 
level. Table 3 provides Monte Carlo estimates of the corresponding risk, i.e. the expectation of 
the normalized quadratic loss \\M — -Mlll^. The values in brackets are the estimated standard 
deviations of the latter loss. This table shows that bimonotone shrinkage yields better results than 
componentwise (soft) thresholding. 



bimonotone 
shrinkage (10) 


componen 
r = 0.5 


itwise thres 
T = 0.6 


holding (9) 
T = 1.0 


with 
r = 1.5 


r = 2.0 


0.0790 
(0.0044) 


0.0922 
(0.0050) 


0.0888 
(0.0051) 


0.1044 
(0.0061) 


0.1342 
(0.0073) 


0.1619 
(0.0082) 



Table 3: Estimated risks of different estimators in Example 4.1. 



4.3 Viticultural case study 

In this case study, row i of the data matrix Y G M^^xs j-gports the grape yields harvested in 3 
successive years from a vineyard near Lake Erie that has 52 rows of vines. The data is taken from 
Chatterjee, Handcock, and Simonoff [5]. The grape yields, measured in lugs of grapes harvested 
from each vineyard-row, are plotted in the upper left panel of Figure 6, using a different plotting 
character for each of the three years. The analysis seeks to bring out patterns in the vineyard- 
row yields that persist across years. Year and vineyard-row are both ordinal covariates. The 
covariate vineyard-row summarizes location-dependent effects that may be due to soil fertility 
and microclimate. The covariate year summarizes time-varying effects that may be due to rainfall 
pattern, temperatures, and viticultural practices. 

A preliminary data analysis based on running means and variance estimates from triplets 
{Yij ,Yi^ij ,Yi^2,j), 1 < i < 50, revealed that a square-root transformation yields a data ma- 
trix Z G M^^^^ which may be viewed as a two-way layout in which both the row and column 

numbers are ordinal covariates, the measurement errors are independent with mean zero and com- 
mon unknown variance and unknown mean matrix M = E Z. 

Now we appUed the orthonormal transformation into spline bases with = i and = j, 
where k = 2 and £ = 1. In particular, tti and U2 are proportional to I52 and {i — 26.5)|£^, 
respectively. Similarly, vi, V2 and ^3 are proportional to I3, (—1,0, 1)^ and (1, —2, 1)^, respec- 
tively. The graphs of K I — > o'l K and K I — > d'2 K revealed that a = 0.25 is a plausible estimator 
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for a. The resulting fitted matrix M is shown in the upper right panel of Figure 6, adding linear 
interpolation between adjacent elements to bring out their trend. In addition the transformed data 
Zij are superimposed as single points. 

The estimated mean grape yields reveal shared patterns across the three years. Large dips in 
estimated mean grape yields occur in the outermost rows of the vineyard and near row 33. These 
point to possible geographical variations in growing conditions, such as harsher climate at the 
vineyard edges or changes in soil fertility. 

It is also interesting to split the fit M into an additive part (including constant) and interactions, 

r s 
1=2 j=2 

r s 

-Winter = '^i'"] ■ 

1=2 j=2 

The lower panels of Figure 6 depict these parts separately. The plot of the additive part emphasizes 
the pattern across rows just described and the (nonlinear) increase across years. The interactions 
reveal that a simple additive model doesn't seem appropriate for these data. 
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transf. squared coefficients shrinkage matrix 




lO 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 60 90 100 



Figure 4: Shrinkage estimation: Transformed squared coefficients Zf^liX + Zj^ (left) and bi- 
monotone shrinkage matrix 7 (right). 




J I 1 1 1 1 1 1 1 1 1 1 

0.5 1 1.5 2 2.5 3 3.5 4 4.5 

Figure 5: Shrinkage estimation: Average quadratic loss as a function of a. 
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